ENGSCI 712 HAR Assignment¶
Introduction¶
This experiment aims to classify walking and running activites by extracting relevant features from time-series data, and fitting a Random Forest Classifier to these features.
Experiment¶
The linear acceleration in the x, y and z directions and the total acceleration of the subject was recorded using the 'Physics Toolbox' app on an iPhone 15 Pro.
The experiment was conducted over 6 different speeds on a treadmill, and were labelled as either walking or running:
- Walking: 2 km/h, 3 km/h, 4 km/h.
- Running: 8 km/h, 10 km/h, 12 km/h.
Each sample was recorded over approximately three minutes, with the iPhone held in the subject’s right hand. A standard running technique was used, with the arms positioned roughly perpendicular to the ground, swinging back and forth. Similarly, the walking technique followed a typical form, with the arms relaxed by the sides and swinging naturally back and forth.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pickle
from tsfresh.transformers import FeatureSelector
from tsfresh.feature_extraction import extract_features
from tsfresh.utilities.dataframe_functions import impute
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, RepeatedKFold
from pathlib import Path
warnings.filterwarnings(action='ignore')
%matplotlib inline
Directory Paths¶
data_dir = Path('..') / 'data'
fig_dir = Path('..') / 'reports' / 'figures'
model_dir = Path('..') / 'models'
Loading the Raw Data¶
two_kmph_df = pd.read_csv('../data/raw/acceleration-2kmph-walking.csv', index_col='time')
three_kmph_df = pd.read_csv('../data/raw/acceleration-3kmph-walking.csv', index_col='time')
four_kmph_df = pd.read_csv('../data/raw/acceleration-4kmph-walking.csv', index_col='time')
eight_kmph_df = pd.read_csv('../data/raw/acceleration-8kmph-running.csv', index_col='time')
ten_kmph_df = pd.read_csv('../data/raw/acceleration-10kmph-running.csv', index_col='time')
twelve_kmph_df = pd.read_csv('../data/raw/acceleration-12kmph-running.csv', index_col='time')
two_kmph_df.head()
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.00 | 0.22 | -0.54 | 0.48 | 0.75 |
| 0.01 | -0.07 | -0.72 | 0.23 | 0.76 |
| 0.02 | -0.46 | -0.75 | -0.32 | 0.94 |
| 0.03 | -0.77 | -0.61 | -0.47 | 1.09 |
| 0.04 | -0.79 | -0.44 | -0.36 | 0.97 |
Visualisation of Raw Data¶
Comparison of acceleration in the x-direction¶
all_speeds = [two_kmph_df, three_kmph_df, four_kmph_df, eight_kmph_df, ten_kmph_df, twelve_kmph_df] # List of all samples.
all_names = ['2_kmph', '3_kmph', '4_kmph', '8_kmph', '10_kmph', '12_kmph']
column_labels = two_kmph_df.columns
fig_ax, ax_x = plt.subplots(6, sharex=True)
fig_ax.set_figwidth(12)
fig_ax.set_figheight(22)
for i in range(len(ax_x)):
ax_x[i].set_title(f'{all_names[i]}_ax')
ax_x[i].set_xlabel('Time/s')
ax_x[i].set_ylabel('Acceleration/ms-2')
ax_x[i].plot(all_speeds[i].index, all_speeds[i]['ax'])
plt.savefig(fig_dir / 'ax_raw_fig.png')
plt.show()
Comparison of acceleration in the y-direction¶
fig_ay, ax_y = plt.subplots(6, sharex=True)
fig_ay.set_figwidth(12)
fig_ay.set_figheight(22)
for i in range(len(ax_y)):
ax_y[i].set_title(f'{all_names[i]}_ay')
ax_y[i].set_xlabel('Time/s')
ax_y[i].set_ylabel('Acceleration/ms-2')
ax_y[i].plot(all_speeds[i].index, all_speeds[i]['ay'])
plt.savefig(fig_dir / 'ay_raw_fig.png')
plt.show()
Comparison of acceleration in the z-direction¶
fig_az, ax_z = plt.subplots(6, sharex=True)
fig_az.set_figwidth(12)
fig_az.set_figheight(22)
for i in range(len(ax_z)):
ax_z[i].set_title(f'{all_names[i]}_az')
ax_z[i].set_xlabel('Time/s')
ax_z[i].set_ylabel('Acceleration/ms-2')
ax_z[i].plot(all_speeds[i].index, all_speeds[i]['az'])
plt.savefig(fig_dir / 'az_raw_fig.png')
plt.show()
Comparison of total acceleration¶
fig_atotal, ax_total = plt.subplots(6, sharex=True)
fig_atotal.set_figwidth(12)
fig_atotal.set_figheight(22)
for i in range(len(ax_total)):
ax_total[i].set_title(f'{all_names[i]}_atotal')
ax_total[i].set_xlabel('Time/s')
ax_total[i].set_ylabel('Acceleration/ms-2')
ax_total[i].plot(all_speeds[i].index, all_speeds[i]['atotal'])
plt.savefig(fig_dir / 'atotal_raw_fig.png')
plt.show()
Observations¶
There is a general increase in the amplitude and frequency of the signals as the speed of the activity increases. This is expected because a faster walk/run would result in faster swinging of the arms.
The duration of each sample was defined by noticeable spikes in signal amplitude at both the beginning and end, particularly in the z-direction. These sudden movements were intentionally introduced at the start and end of each walking or running activity to clearly mark the sample boundaries.
Preprocessing¶
The raw data time column was already formatted as a 'delta_t" column so creating this column was not necessary.
Removal of transition periods¶
The transitions periods were found by searching through the dataframes, and removing the outlier observations at the start and end.
pruned_two_kmph_df = two_kmph_df[(two_kmph_df.index >= 1.949) & (two_kmph_df.index <= 166.494)] # Pruning
pruned_two_kmph_df.index -= pruned_two_kmph_df.index[0] # Resetting index to start at time = 0.000
pruned_two_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | 6.04 | 0.33 | 1.03 | 6.14 |
| 0.010 | 5.76 | -0.20 | 0.97 | 5.84 |
| 0.020 | 5.57 | -0.44 | 1.10 | 5.69 |
| 0.030 | 5.37 | -0.93 | 0.93 | 5.53 |
| 0.040 | 5.07 | -1.66 | 0.79 | 5.39 |
| ... | ... | ... | ... | ... |
| 164.505 | 4.82 | -1.04 | 1.53 | 5.17 |
| 164.515 | 4.89 | -1.50 | 1.84 | 5.44 |
| 164.525 | 5.22 | -2.11 | 1.97 | 5.97 |
| 164.535 | 5.60 | -2.39 | 2.15 | 6.45 |
| 164.545 | 5.92 | -2.66 | 2.41 | 6.92 |
16463 rows × 4 columns
pruned_three_kmph_df = three_kmph_df[(three_kmph_df.index >= 0.880) & (three_kmph_df.index <= 178.542)]
pruned_three_kmph_df.index -= pruned_three_kmph_df.index[0]
pruned_three_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | 4.10 | 3.34 | 2.50 | 5.85 |
| 0.010 | 4.29 | 4.09 | 1.73 | 6.18 |
| 0.020 | 4.37 | 4.65 | 0.98 | 6.45 |
| 0.030 | 4.46 | 4.88 | 0.11 | 6.61 |
| 0.040 | 4.33 | 4.84 | -0.59 | 6.52 |
| ... | ... | ... | ... | ... |
| 177.622 | 5.02 | 0.21 | 1.00 | 5.12 |
| 177.632 | 5.08 | 0.04 | 1.24 | 5.23 |
| 177.642 | 5.34 | -0.71 | 1.61 | 5.62 |
| 177.652 | 5.59 | -1.35 | 2.02 | 6.09 |
| 177.662 | 6.18 | -2.02 | 2.77 | 7.07 |
17772 rows × 4 columns
pruned_four_kmph_df = four_kmph_df[(four_kmph_df.index >= 1.590) & (four_kmph_df.index <= 178.856)]
pruned_four_kmph_df.index -= pruned_four_kmph_df.index[0]
pruned_four_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | -5.37 | -7.04 | -2.20 | 9.13 |
| 0.009 | -4.03 | -6.25 | -1.15 | 7.52 |
| 0.020 | -2.37 | -5.30 | 0.06 | 5.81 |
| 0.029 | -1.08 | -4.49 | 1.06 | 4.74 |
| 0.040 | -0.99 | -4.37 | 1.20 | 4.64 |
| ... | ... | ... | ... | ... |
| 177.226 | 7.05 | 3.15 | 1.12 | 7.80 |
| 177.236 | 7.09 | 3.13 | 1.33 | 7.86 |
| 177.247 | 7.18 | 3.07 | 1.55 | 7.96 |
| 177.256 | 7.42 | 3.08 | 1.94 | 8.26 |
| 177.266 | 7.77 | 3.13 | 2.50 | 8.74 |
17733 rows × 4 columns
pruned_eight_kmph_df = eight_kmph_df[(eight_kmph_df.index >= 0.710) & (eight_kmph_df.index <= 178.635)]
pruned_eight_kmph_df.index -= pruned_eight_kmph_df.index[0]
pruned_eight_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | 3.40 | 4.45 | 8.10 | 9.85 |
| 0.010 | 1.73 | 4.43 | 4.86 | 6.79 |
| 0.020 | 0.35 | 4.27 | 2.24 | 4.83 |
| 0.030 | -0.13 | 3.88 | 0.81 | 3.96 |
| 0.040 | 0.36 | 3.48 | 0.15 | 3.50 |
| ... | ... | ... | ... | ... |
| 177.885 | 0.17 | -0.11 | -1.88 | 1.89 |
| 177.895 | -0.76 | 2.07 | 3.42 | 4.07 |
| 177.905 | -2.55 | 4.68 | 5.79 | 7.87 |
| 177.915 | -4.01 | 7.83 | 6.26 | 10.79 |
| 177.925 | -4.73 | 10.97 | 7.02 | 13.86 |
17799 rows × 4 columns
pruned_ten_kmph_df = ten_kmph_df[(ten_kmph_df.index >= 0.380) & (ten_kmph_df.index <= 179.094)]
pruned_ten_kmph_df.index -= pruned_ten_kmph_df.index[0]
pruned_ten_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | 4.82 | 2.04 | 10.23 | 11.49 |
| 0.010 | 3.50 | 1.42 | 7.48 | 8.38 |
| 0.020 | 2.37 | 0.49 | 4.54 | 5.15 |
| 0.030 | 1.47 | -0.80 | 2.18 | 2.75 |
| 0.040 | 1.17 | -1.40 | 1.19 | 2.18 |
| ... | ... | ... | ... | ... |
| 178.674 | -8.75 | -2.17 | -0.78 | 9.04 |
| 178.684 | -9.40 | -0.62 | -0.03 | 9.42 |
| 178.694 | -10.02 | 1.27 | 1.31 | 10.18 |
| 178.704 | -9.89 | 4.08 | 3.96 | 11.41 |
| 178.714 | -6.29 | 7.13 | 8.02 | 12.44 |
17878 rows × 4 columns
pruned_twelve_kmph_df = twelve_kmph_df[(twelve_kmph_df.index >= 0.580) & (twelve_kmph_df.index <= 179.242)]
pruned_twelve_kmph_df.index -= pruned_twelve_kmph_df.index[0] # Reset index.
pruned_twelve_kmph_df
| ax | ay | az | atotal | |
|---|---|---|---|---|
| time | ||||
| 0.000 | 11.49 | 5.63 | 19.67 | 23.46 |
| 0.010 | 9.19 | 3.58 | 15.33 | 18.23 |
| 0.020 | 7.17 | 2.00 | 10.68 | 13.02 |
| 0.030 | 5.54 | 0.32 | 7.13 | 9.04 |
| 0.040 | 3.96 | -1.35 | 4.88 | 6.43 |
| ... | ... | ... | ... | ... |
| 178.622 | -6.79 | 2.00 | 0.80 | 7.12 |
| 178.632 | -6.79 | 2.23 | 1.78 | 7.37 |
| 178.642 | -6.68 | 3.70 | 3.64 | 8.46 |
| 178.652 | -4.83 | 6.53 | 7.26 | 10.90 |
| 178.662 | 1.25 | 10.46 | 14.18 | 17.67 |
17873 rows × 4 columns
Accelerations in the x-direction after pruning¶
all_speeds_pruned = [pruned_two_kmph_df, pruned_three_kmph_df, pruned_four_kmph_df, pruned_eight_kmph_df, pruned_ten_kmph_df, pruned_twelve_kmph_df]
fig_ax_pruned, ax_x_pruned = plt.subplots(6, sharex=True)
fig_ax_pruned.set_figwidth(12)
fig_ax_pruned.set_figheight(22)
for i in range(len(ax_x)):
ax_x_pruned[i].set_title(f'{all_names[i]}_ax')
ax_x_pruned[i].set_xlabel('Time/s')
ax_x_pruned[i].set_ylabel('Acceleration/ms-2')
ax_x_pruned[i].plot(all_speeds_pruned[i].index, all_speeds_pruned[i]['ax'])
plt.savefig(fig_dir / 'ax_pruned_fig.png')
plt.show()
Accelerations in the y-direction after pruning¶
fig_ay_pruned, ax_y_pruned = plt.subplots(6, sharex=True)
fig_ay_pruned.set_figwidth(12)
fig_ay_pruned.set_figheight(22)
for i in range(len(ax_x)):
ax_y_pruned[i].set_title(f'{all_names[i]}_ay')
ax_y_pruned[i].set_xlabel('Time/s')
ax_y_pruned[i].set_ylabel('Acceleration/ms-2')
ax_y_pruned[i].plot(all_speeds_pruned[i].index, all_speeds_pruned[i]['ay'])
plt.savefig(fig_dir / 'ay_pruned_fig.png')
plt.show()
Accelerations in the z-direction after pruning¶
fig_az_pruned, ax_z_pruned = plt.subplots(6, sharex=True)
fig_az_pruned.set_figwidth(12)
fig_az_pruned.set_figheight(22)
for i in range(len(ax_x)):
ax_z_pruned[i].set_title(f'{all_names[i]}_az')
ax_z_pruned[i].set_xlabel('Time/s')
ax_z_pruned[i].set_ylabel('Acceleration/ms-2')
ax_z_pruned[i].plot(all_speeds_pruned[i].index, all_speeds_pruned[i]['az'])
plt.savefig(fig_dir / 'az_pruned_fig.png')
plt.show()
Total Accelerations after pruning¶
fig_atotal_pruned, ax_total_pruned = plt.subplots(6, sharex=True)
fig_atotal_pruned.set_figwidth(12)
fig_atotal_pruned.set_figheight(22)
for i in range(len(ax_x)):
ax_total_pruned[i].set_title(f'{all_names[i]}_atotal')
ax_total_pruned[i].set_xlabel('Time/s')
ax_total_pruned[i].set_ylabel('Acceleration/ms-2')
ax_total_pruned[i].plot(all_speeds_pruned[i].index, all_speeds_pruned[i]['atotal'])
plt.savefig(fig_dir / 'atotal_pruned_fig.png')
plt.show()
for i in range(len(all_speeds_pruned)):
print(f'{all_names[i]} signal length in seconds: {len(all_speeds_pruned[i]) / 100}')
2_kmph signal length in seconds: 164.63 3_kmph signal length in seconds: 177.72 4_kmph signal length in seconds: 177.33 8_kmph signal length in seconds: 177.99 10_kmph signal length in seconds: 178.78 12_kmph signal length in seconds: 178.73
The lengths of each signal are approximately equivalent, so there shouldn't be any concerns with an unbalanced dataset.
Extraction of Features¶
Window Indexing¶
'w' will represent walking, and 'r' will represent running in the window indices.
# Creating window indices.
idx_two = np.asarray(np.floor(pruned_two_kmph_df.reset_index().index / 100).values, int)
idx_three = np.asarray(np.floor(pruned_three_kmph_df.reset_index().index / 100).values + idx_two[-1] + 1, int)
idx_four = np.asarray(np.floor(pruned_four_kmph_df.reset_index().index / 100).values + idx_three[-1] + 1, int)
# Prefixing with 'w' which stands for walking.
pruned_two_kmph_df['window_idx'] = ["w{:03}".format(i) for i in idx_two]
pruned_three_kmph_df['window_idx'] = ["w{:03}".format(i) for i in idx_three]
pruned_four_kmph_df['window_idx'] = ["w{:03}".format(i) for i in idx_four]
pruned_two_kmph_df.to_csv(data_dir / "interim" / "pruned_two_kmph.csv", index=False)
pruned_three_kmph_df.to_csv(data_dir / "interim" / "pruned_three_kmph.csv", index=False)
pruned_four_kmph_df.to_csv(data_dir / "interim" / "pruned_four_kmph.csv", index=False)
pruned_two_kmph_df
| ax | ay | az | atotal | window_idx | |
|---|---|---|---|---|---|
| time | |||||
| 0.000 | 6.04 | 0.33 | 1.03 | 6.14 | w000 |
| 0.010 | 5.76 | -0.20 | 0.97 | 5.84 | w000 |
| 0.020 | 5.57 | -0.44 | 1.10 | 5.69 | w000 |
| 0.030 | 5.37 | -0.93 | 0.93 | 5.53 | w000 |
| 0.040 | 5.07 | -1.66 | 0.79 | 5.39 | w000 |
| ... | ... | ... | ... | ... | ... |
| 164.505 | 4.82 | -1.04 | 1.53 | 5.17 | w164 |
| 164.515 | 4.89 | -1.50 | 1.84 | 5.44 | w164 |
| 164.525 | 5.22 | -2.11 | 1.97 | 5.97 | w164 |
| 164.535 | 5.60 | -2.39 | 2.15 | 6.45 | w164 |
| 164.545 | 5.92 | -2.66 | 2.41 | 6.92 | w164 |
16463 rows × 5 columns
idx_eight = np.asarray(np.floor(pruned_eight_kmph_df.reset_index().index / 100).values, int)
idx_ten = np.asarray(np.floor(pruned_ten_kmph_df.reset_index().index / 100).values + idx_eight[-1] + 1, int)
idx_twelve = np.asarray(np.floor(pruned_twelve_kmph_df.reset_index().index / 100).values + idx_ten[-1] + 1, int)
# Prepend 'r' to window_idx to represent running..
pruned_eight_kmph_df['window_idx'] = ["r{:03}".format(i) for i in idx_eight]
pruned_ten_kmph_df['window_idx'] = ["r{:03}".format(i) for i in idx_ten]
pruned_twelve_kmph_df['window_idx'] = ["r{:03}".format(i) for i in idx_twelve]
pruned_eight_kmph_df.to_csv(data_dir / "interim" / "pruned_eight_kmph.csv", index=False)
pruned_ten_kmph_df.to_csv(data_dir / "interim" / "pruned_ten_kmph.csv", index=False)
pruned_twelve_kmph_df.to_csv(data_dir / "interim" / "pruned_twelve_kmph.csv", index=False)
pruned_eight_kmph_df
| ax | ay | az | atotal | window_idx | |
|---|---|---|---|---|---|
| time | |||||
| 0.000 | 3.40 | 4.45 | 8.10 | 9.85 | r000 |
| 0.010 | 1.73 | 4.43 | 4.86 | 6.79 | r000 |
| 0.020 | 0.35 | 4.27 | 2.24 | 4.83 | r000 |
| 0.030 | -0.13 | 3.88 | 0.81 | 3.96 | r000 |
| 0.040 | 0.36 | 3.48 | 0.15 | 3.50 | r000 |
| ... | ... | ... | ... | ... | ... |
| 177.885 | 0.17 | -0.11 | -1.88 | 1.89 | r177 |
| 177.895 | -0.76 | 2.07 | 3.42 | 4.07 | r177 |
| 177.905 | -2.55 | 4.68 | 5.79 | 7.87 | r177 |
| 177.915 | -4.01 | 7.83 | 6.26 | 10.79 | r177 |
| 177.925 | -4.73 | 10.97 | 7.02 | 13.86 | r177 |
17799 rows × 5 columns
Concatenating dataframes¶
walking_df = pd.concat([pruned_two_kmph_df, pruned_three_kmph_df, pruned_four_kmph_df])
walking_df.to_csv(data_dir / "interim" / "walking.csv", index=False)
walking_df
| ax | ay | az | atotal | window_idx | |
|---|---|---|---|---|---|
| time | |||||
| 0.000 | 6.04 | 0.33 | 1.03 | 6.14 | w000 |
| 0.010 | 5.76 | -0.20 | 0.97 | 5.84 | w000 |
| 0.020 | 5.57 | -0.44 | 1.10 | 5.69 | w000 |
| 0.030 | 5.37 | -0.93 | 0.93 | 5.53 | w000 |
| 0.040 | 5.07 | -1.66 | 0.79 | 5.39 | w000 |
| ... | ... | ... | ... | ... | ... |
| 177.226 | 7.05 | 3.15 | 1.12 | 7.80 | w520 |
| 177.236 | 7.09 | 3.13 | 1.33 | 7.86 | w520 |
| 177.247 | 7.18 | 3.07 | 1.55 | 7.96 | w520 |
| 177.256 | 7.42 | 3.08 | 1.94 | 8.26 | w520 |
| 177.266 | 7.77 | 3.13 | 2.50 | 8.74 | w520 |
51968 rows × 5 columns
running_df = pd.concat([pruned_eight_kmph_df, pruned_ten_kmph_df, pruned_twelve_kmph_df])
running_df.to_csv(data_dir / "interim" / "running.csv", index=False)
running_df
| ax | ay | az | atotal | window_idx | |
|---|---|---|---|---|---|
| time | |||||
| 0.000 | 3.40 | 4.45 | 8.10 | 9.85 | r000 |
| 0.010 | 1.73 | 4.43 | 4.86 | 6.79 | r000 |
| 0.020 | 0.35 | 4.27 | 2.24 | 4.83 | r000 |
| 0.030 | -0.13 | 3.88 | 0.81 | 3.96 | r000 |
| 0.040 | 0.36 | 3.48 | 0.15 | 3.50 | r000 |
| ... | ... | ... | ... | ... | ... |
| 178.622 | -6.79 | 2.00 | 0.80 | 7.12 | r535 |
| 178.632 | -6.79 | 2.23 | 1.78 | 7.37 | r535 |
| 178.642 | -6.68 | 3.70 | 3.64 | 8.46 | r535 |
| 178.652 | -4.83 | 6.53 | 7.26 | 10.90 | r535 |
| 178.662 | 1.25 | 10.46 | 14.18 | 17.67 | r535 |
53550 rows × 5 columns
There are 51968 rows combined in the walking dataframe, and 53550 rows combined in the running dataframe.
walking_running_df = pd.concat([walking_df, running_df])
walking_running_df.to_csv(data_dir / "interim" / "walking_running.csv", index=False)
walking_running_df
| ax | ay | az | atotal | window_idx | |
|---|---|---|---|---|---|
| time | |||||
| 0.000 | 6.04 | 0.33 | 1.03 | 6.14 | w000 |
| 0.010 | 5.76 | -0.20 | 0.97 | 5.84 | w000 |
| 0.020 | 5.57 | -0.44 | 1.10 | 5.69 | w000 |
| 0.030 | 5.37 | -0.93 | 0.93 | 5.53 | w000 |
| 0.040 | 5.07 | -1.66 | 0.79 | 5.39 | w000 |
| ... | ... | ... | ... | ... | ... |
| 178.622 | -6.79 | 2.00 | 0.80 | 7.12 | r535 |
| 178.632 | -6.79 | 2.23 | 1.78 | 7.37 | r535 |
| 178.642 | -6.68 | 3.70 | 3.64 | 8.46 | r535 |
| 178.652 | -4.83 | 6.53 | 7.26 | 10.90 | r535 |
| 178.662 | 1.25 | 10.46 | 14.18 | 17.67 | r535 |
105518 rows × 5 columns
The combined walking and running dataframes have 105518 rows in total.
walking_running_df['window_idx'].nunique()
1057
There are 1057 unique window indices.
Extraction of Features¶
extracted_features = extract_features(walking_running_df, column_id='window_idx')
Feature Extraction: 100%|██████████| 20/20 [02:04<00:00, 6.23s/it]
impute(extracted_features) # Impute NA/null values.
extracted_features.to_csv(data_dir / "interim" / "extracted_features.csv", index=False)
extracted_features
| ax__variance_larger_than_standard_deviation | ax__has_duplicate_max | ax__has_duplicate_min | ax__has_duplicate | ax__sum_values | ax__abs_energy | ax__mean_abs_change | ax__mean_change | ax__mean_second_derivative_central | ax__median | ... | atotal__fourier_entropy__bins_5 | atotal__fourier_entropy__bins_10 | atotal__fourier_entropy__bins_100 | atotal__permutation_entropy__dimension_3__tau_1 | atotal__permutation_entropy__dimension_4__tau_1 | atotal__permutation_entropy__dimension_5__tau_1 | atotal__permutation_entropy__dimension_6__tau_1 | atotal__permutation_entropy__dimension_7__tau_1 | atotal__query_similarity_count__query_None__threshold_0.0 | atotal__mean_n_absolute_max__number_of_maxima_7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| r000 | 1.0 | 0.0 | 0.0 | 1.0 | 86.13 | 1655.8011 | 0.681111 | 0.039495 | 0.006633 | -0.075 | ... | 0.575228 | 0.706253 | 1.283294 | 1.374172 | 2.122390 | 2.857326 | 3.476007 | 3.990108 | 0.0 | 12.088571 |
| r001 | 1.0 | 0.0 | 0.0 | 1.0 | -66.64 | 5067.8256 | 1.277879 | -0.083131 | 0.016633 | -2.755 | ... | 0.369812 | 0.508382 | 1.266372 | 1.165692 | 1.682566 | 2.242483 | 2.781797 | 3.209318 | 0.0 | 23.260000 |
| r002 | 1.0 | 0.0 | 0.0 | 1.0 | 72.64 | 2051.5922 | 0.859192 | 0.020808 | -0.011020 | 0.115 | ... | 0.481199 | 0.668811 | 1.103096 | 1.144288 | 1.580317 | 2.029087 | 2.486393 | 2.894001 | 0.0 | 15.138571 |
| r003 | 1.0 | 0.0 | 0.0 | 1.0 | 1.78 | 1935.8234 | 0.776869 | 0.003535 | 0.003214 | -0.855 | ... | 0.413917 | 0.508382 | 1.318528 | 1.168784 | 1.649243 | 2.143991 | 2.615653 | 3.061648 | 0.0 | 15.002857 |
| r004 | 1.0 | 0.0 | 0.0 | 1.0 | 70.12 | 2665.7300 | 0.916970 | -0.037980 | -0.003929 | -0.940 | ... | 0.464277 | 0.518641 | 0.919223 | 1.116646 | 1.566289 | 1.959137 | 2.364171 | 2.763991 | 0.0 | 16.927143 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| w516 | 1.0 | 1.0 | 0.0 | 1.0 | 24.00 | 2248.3118 | 0.261414 | 0.059394 | 0.000816 | -0.530 | ... | 0.288342 | 0.288342 | 0.760618 | 0.937191 | 1.231097 | 1.485550 | 1.736844 | 2.003761 | 0.0 | 8.751429 |
| w517 | 1.0 | 0.0 | 0.0 | 1.0 | -58.68 | 1452.6522 | 0.245455 | 0.040808 | 0.003776 | -1.650 | ... | 0.165443 | 0.165443 | 0.695993 | 1.115882 | 1.539055 | 1.953655 | 2.340578 | 2.683975 | 0.0 | 7.587143 |
| w518 | 1.0 | 0.0 | 0.0 | 1.0 | -27.39 | 1695.3341 | 0.216364 | -0.085051 | 0.000765 | -1.705 | ... | 0.288342 | 0.288342 | 0.545824 | 1.095106 | 1.520087 | 1.927013 | 2.302943 | 2.600988 | 0.0 | 8.987143 |
| w519 | 1.0 | 0.0 | 0.0 | 1.0 | 93.73 | 2043.0013 | 0.261414 | -0.022020 | -0.002653 | 1.225 | ... | 0.288342 | 0.288342 | 0.695993 | 1.194533 | 1.743452 | 2.231419 | 2.698119 | 3.120640 | 0.0 | 7.332857 |
| w520 | 1.0 | 0.0 | 0.0 | 1.0 | 85.86 | 586.0924 | 0.308750 | 0.290625 | 0.002097 | 2.210 | ... | 0.443757 | 0.659872 | 0.871781 | 0.919907 | 1.216203 | 1.381154 | 1.549826 | 1.669296 | 0.0 | 7.915714 |
1057 rows × 3132 columns
The resulting feature matrix has 1057 rows and 3132 columns. In other words, 3132 features have been extracted over 1057 windows.
Activity Recognition Model¶
Denote running as true and walking as false.
target = pd.Series(extracted_features.index.str[0] == 'r', index=extracted_features.index)
target
r000 True
r001 True
r002 True
r003 True
r004 True
...
w516 False
w517 False
w518 False
w519 False
w520 False
Length: 1057, dtype: bool
Feature Selection¶
select = FeatureSelector()
select.fit(extracted_features, target)
FeatureSelector()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureSelector()
Feature Reduction¶
select.transform(extracted_features)
| az__variance_larger_than_standard_deviation | ax__number_crossing_m__m_-1 | az__ratio_value_number_to_time_series_length | az__absolute_maximum | az__maximum | ay__maximum | atotal__maximum | atotal__absolute_maximum | az__quantile__q_0.8 | az__quantile__q_0.1 | ... | atotal__cwt_coefficients__coeff_6__w_2__widths_(2, 5, 10, 20) | ay__cwt_coefficients__coeff_8__w_2__widths_(2, 5, 10, 20) | ay__cwt_coefficients__coeff_1__w_10__widths_(2, 5, 10, 20) | atotal__fft_coefficient__attr_"angle"__coeff_17 | atotal__change_quantiles__f_agg_"mean"__isabs_False__qh_1.0__ql_0.8 | atotal__ar_coefficient__coeff_9__k_10 | ax__friedrich_coefficients__coeff_0__m_3__r_30 | az__fft_coefficient__attr_"angle"__coeff_43 | ax__fft_coefficient__attr_"angle"__coeff_1 | ay__has_duplicate_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| r000 | 1.0 | 6.0 | 0.940000 | 11.20 | 11.20 | 7.58 | 13.14 | 13.14 | 5.762 | -3.880 | ... | -6.353659 | -0.175893 | 1.941404 | 89.845029 | 0.065333 | 0.109936 | -0.000966 | -18.856243 | 9.153098 | 0.0 |
| r001 | 1.0 | 7.0 | 0.990000 | 14.89 | 14.89 | 9.43 | 29.18 | 29.18 | 3.218 | -3.976 | ... | 1.133600 | 0.410486 | 8.307796 | -17.953434 | -0.274706 | -0.151449 | -0.000304 | -12.548361 | 153.895776 | 0.0 |
| r002 | 1.0 | 5.0 | 0.950000 | 10.66 | 10.66 | 6.44 | 17.33 | 17.33 | 5.576 | -4.829 | ... | 0.461592 | 4.009567 | 7.893092 | 17.612480 | 0.116250 | 0.937414 | 0.000122 | 116.239548 | -38.470514 | 0.0 |
| r003 | 1.0 | 6.0 | 0.970000 | 10.30 | 10.30 | 6.49 | 16.28 | 16.28 | 2.810 | -5.117 | ... | 1.495860 | -0.861225 | -7.231616 | 145.591365 | 0.087500 | -0.031573 | -0.001296 | -5.472196 | -143.192616 | 0.0 |
| r004 | 1.0 | 5.0 | 0.960000 | 12.20 | 12.20 | 8.19 | 19.07 | 19.07 | 6.310 | -4.689 | ... | -1.133400 | 2.478037 | -3.543754 | 17.552563 | -0.113125 | -0.041570 | -0.000166 | 127.388472 | 33.136497 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| w516 | 0.0 | 1.0 | 0.840000 | 2.00 | 1.57 | 4.31 | 9.03 | 9.03 | 1.060 | -1.020 | ... | 0.412208 | -0.172704 | 0.078933 | 140.826072 | -0.010000 | 0.127181 | -0.000585 | -8.058655 | 81.531072 | 0.0 |
| w517 | 0.0 | 2.0 | 0.880000 | 2.35 | 2.35 | 1.36 | 8.35 | 8.35 | 1.244 | -0.946 | ... | 0.155062 | -0.711380 | 3.854368 | 101.681248 | 0.090556 | 0.137142 | 0.000878 | 159.152159 | 18.323403 | 0.0 |
| w518 | 0.0 | 2.0 | 0.810000 | 1.64 | 1.64 | 3.64 | 9.24 | 9.24 | 0.896 | -0.740 | ... | 1.370432 | 0.728550 | 2.459767 | -58.089717 | -0.118333 | 0.367404 | 0.000721 | -4.611620 | -46.002433 | 0.0 |
| w519 | 0.0 | 1.0 | 0.720000 | 2.32 | 1.86 | 3.75 | 7.51 | 7.51 | 1.010 | -0.250 | ... | 0.023910 | 1.001787 | -2.684173 | 89.093543 | -0.020000 | 0.121748 | -0.000866 | -10.821067 | -92.594920 | 0.0 |
| w520 | 0.0 | 1.0 | 0.939394 | 2.50 | 2.50 | 3.15 | 8.74 | 8.74 | 0.900 | -0.796 | ... | 1.611997 | -0.235135 | -7.203606 | -0.722175 | 0.260000 | 0.647203 | -0.000445 | 1.482437 | 90.135916 | 0.0 |
1057 rows × 1202 columns
p_values_top5 = select.p_values[:5]
p_values_top5
array([5.00047690e-238, 2.38755945e-177, 2.74217433e-174, 3.20327653e-174,
3.20534176e-174])
relevant_features_top5 = select.relevant_features[:5]
relevant_features_top5
['az__variance_larger_than_standard_deviation', 'ax__number_crossing_m__m_-1', 'az__ratio_value_number_to_time_series_length', 'az__absolute_maximum', 'az__maximum']
5 Features with the smallest p-values¶
relevant_feats_p_value_top5 = pd.DataFrame({"Feature": relevant_features_top5, "p-values": p_values_top5})
relevant_feats_p_value_top5.to_csv(data_dir / "interim" / "relevant_feats_p_value_top5.csv", index=False)
relevant_feats_p_value_top5
| Feature | p-values | |
|---|---|---|
| 0 | az__variance_larger_than_standard_deviation | 5.000477e-238 |
| 1 | ax__number_crossing_m__m_-1 | 2.387559e-177 |
| 2 | az__ratio_value_number_to_time_series_length | 2.742174e-174 |
| 3 | az__absolute_maximum | 3.203277e-174 |
| 4 | az__maximum | 3.205342e-174 |
relevant_features_top5_df = pd.concat([extracted_features[relevant_features_top5], pd.DataFrame(target, columns=["target"])], axis=1)
relevant_features_top5_df
| az__variance_larger_than_standard_deviation | ax__number_crossing_m__m_-1 | az__ratio_value_number_to_time_series_length | az__absolute_maximum | az__maximum | target | |
|---|---|---|---|---|---|---|
| r000 | 1.0 | 6.0 | 0.940000 | 11.20 | 11.20 | True |
| r001 | 1.0 | 7.0 | 0.990000 | 14.89 | 14.89 | True |
| r002 | 1.0 | 5.0 | 0.950000 | 10.66 | 10.66 | True |
| r003 | 1.0 | 6.0 | 0.970000 | 10.30 | 10.30 | True |
| r004 | 1.0 | 5.0 | 0.960000 | 12.20 | 12.20 | True |
| ... | ... | ... | ... | ... | ... | ... |
| w516 | 0.0 | 1.0 | 0.840000 | 2.00 | 1.57 | False |
| w517 | 0.0 | 2.0 | 0.880000 | 2.35 | 2.35 | False |
| w518 | 0.0 | 2.0 | 0.810000 | 1.64 | 1.64 | False |
| w519 | 0.0 | 1.0 | 0.720000 | 2.32 | 1.86 | False |
| w520 | 0.0 | 1.0 | 0.939394 | 2.50 | 2.50 | False |
1057 rows × 6 columns
Visualisation of the 5 Features with the smallest p-value¶
sns.pairplot(relevant_features_top5_df, hue='target')
plt.savefig(fig_dir / 'five_smallest_p_value_pairplot_fig.png')
plt.show()
The distributions on the diagonal entries of the figure show that these features are good choices for classifying between walking and running due to the lack of overlap. This is further supported by the scatter plots, which also illustrate little to no overlap
all_relevant_features_df = extracted_features[select.relevant_features]
all_relevant_features_df.to_csv(data_dir / "interim" / "all_relevant_features.csv", index=False)
all_relevant_features_df
| az__variance_larger_than_standard_deviation | ax__number_crossing_m__m_-1 | az__ratio_value_number_to_time_series_length | az__absolute_maximum | az__maximum | ay__maximum | atotal__maximum | atotal__absolute_maximum | az__quantile__q_0.8 | az__quantile__q_0.1 | ... | atotal__cwt_coefficients__coeff_6__w_2__widths_(2, 5, 10, 20) | ay__cwt_coefficients__coeff_8__w_2__widths_(2, 5, 10, 20) | ay__cwt_coefficients__coeff_1__w_10__widths_(2, 5, 10, 20) | atotal__fft_coefficient__attr_"angle"__coeff_17 | atotal__change_quantiles__f_agg_"mean"__isabs_False__qh_1.0__ql_0.8 | atotal__ar_coefficient__coeff_9__k_10 | ax__friedrich_coefficients__coeff_0__m_3__r_30 | az__fft_coefficient__attr_"angle"__coeff_43 | ax__fft_coefficient__attr_"angle"__coeff_1 | ay__has_duplicate_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| r000 | 1.0 | 6.0 | 0.940000 | 11.20 | 11.20 | 7.58 | 13.14 | 13.14 | 5.762 | -3.880 | ... | -6.353659 | -0.175893 | 1.941404 | 89.845029 | 0.065333 | 0.109936 | -0.000966 | -18.856243 | 9.153098 | 0.0 |
| r001 | 1.0 | 7.0 | 0.990000 | 14.89 | 14.89 | 9.43 | 29.18 | 29.18 | 3.218 | -3.976 | ... | 1.133600 | 0.410486 | 8.307796 | -17.953434 | -0.274706 | -0.151449 | -0.000304 | -12.548361 | 153.895776 | 0.0 |
| r002 | 1.0 | 5.0 | 0.950000 | 10.66 | 10.66 | 6.44 | 17.33 | 17.33 | 5.576 | -4.829 | ... | 0.461592 | 4.009567 | 7.893092 | 17.612480 | 0.116250 | 0.937414 | 0.000122 | 116.239548 | -38.470514 | 0.0 |
| r003 | 1.0 | 6.0 | 0.970000 | 10.30 | 10.30 | 6.49 | 16.28 | 16.28 | 2.810 | -5.117 | ... | 1.495860 | -0.861225 | -7.231616 | 145.591365 | 0.087500 | -0.031573 | -0.001296 | -5.472196 | -143.192616 | 0.0 |
| r004 | 1.0 | 5.0 | 0.960000 | 12.20 | 12.20 | 8.19 | 19.07 | 19.07 | 6.310 | -4.689 | ... | -1.133400 | 2.478037 | -3.543754 | 17.552563 | -0.113125 | -0.041570 | -0.000166 | 127.388472 | 33.136497 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| w516 | 0.0 | 1.0 | 0.840000 | 2.00 | 1.57 | 4.31 | 9.03 | 9.03 | 1.060 | -1.020 | ... | 0.412208 | -0.172704 | 0.078933 | 140.826072 | -0.010000 | 0.127181 | -0.000585 | -8.058655 | 81.531072 | 0.0 |
| w517 | 0.0 | 2.0 | 0.880000 | 2.35 | 2.35 | 1.36 | 8.35 | 8.35 | 1.244 | -0.946 | ... | 0.155062 | -0.711380 | 3.854368 | 101.681248 | 0.090556 | 0.137142 | 0.000878 | 159.152159 | 18.323403 | 0.0 |
| w518 | 0.0 | 2.0 | 0.810000 | 1.64 | 1.64 | 3.64 | 9.24 | 9.24 | 0.896 | -0.740 | ... | 1.370432 | 0.728550 | 2.459767 | -58.089717 | -0.118333 | 0.367404 | 0.000721 | -4.611620 | -46.002433 | 0.0 |
| w519 | 0.0 | 1.0 | 0.720000 | 2.32 | 1.86 | 3.75 | 7.51 | 7.51 | 1.010 | -0.250 | ... | 0.023910 | 1.001787 | -2.684173 | 89.093543 | -0.020000 | 0.121748 | -0.000866 | -10.821067 | -92.594920 | 0.0 |
| w520 | 0.0 | 1.0 | 0.939394 | 2.50 | 2.50 | 3.15 | 8.74 | 8.74 | 0.900 | -0.796 | ... | 1.611997 | -0.235135 | -7.203606 | -0.722175 | 0.260000 | 0.647203 | -0.000445 | 1.482437 | 90.135916 | 0.0 |
1057 rows × 1202 columns
Fitting Random Forest Classifier and cross-validation¶
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42) # Perform 10-fold cv 10 times.
clf = RandomForestClassifier()
scores = cross_val_score(clf, all_relevant_features_df, target, cv=cv)
scores
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
The scores array indicates that all 100 training and testing folds from the 10-times repeated 10-fold cross-validation achieved 100% prediction accuracy.
Fitting on all data¶
clf2 = RandomForestClassifier()
clf2.fit(all_relevant_features_df, target)
with open(model_dir / 'final_model.pkl','wb') as f:
pickle.dump(clf, f)
importances = clf2.feature_importances_
feature_importance_df = pd.DataFrame({
'Feature': all_relevant_features_df.columns,
'Importance': importances
})
feature_importance_df
| Feature | Importance | |
|---|---|---|
| 0 | az__variance_larger_than_standard_deviation | 0.00 |
| 1 | ax__number_crossing_m__m_-1 | 0.00 |
| 2 | az__ratio_value_number_to_time_series_length | 0.00 |
| 3 | az__absolute_maximum | 0.03 |
| 4 | az__maximum | 0.00 |
| ... | ... | ... |
| 1197 | atotal__ar_coefficient__coeff_9__k_10 | 0.00 |
| 1198 | ax__friedrich_coefficients__coeff_0__m_3__r_30 | 0.00 |
| 1199 | az__fft_coefficient__attr_"angle"__coeff_43 | 0.00 |
| 1200 | ax__fft_coefficient__attr_"angle"__coeff_1 | 0.00 |
| 1201 | ay__has_duplicate_max | 0.00 |
1202 rows × 2 columns
5 Most Important Features Identified by feature_importances_¶
feature_importance_sorted_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_sorted_df.to_csv(data_dir / "interim" / "feature_importance_sorted.csv", index=False)
feature_importance_sorted_df.head(5)
| Feature | Importance | |
|---|---|---|
| 111 | atotal__change_quantiles__f_agg_"var"__isabs_T... | 0.07 |
| 65 | ax__change_quantiles__f_agg_"var"__isabs_True_... | 0.04 |
| 14 | ax__change_quantiles__f_agg_"mean"__isabs_True... | 0.03 |
| 27 | atotal__change_quantiles__f_agg_"mean"__isabs_... | 0.03 |
| 3 | az__absolute_maximum | 0.03 |
The 5 most important features:
- change_quantiles (atotal, ax, ax, atotal): Finds points in the time series less than ql and greater than qh, and computes the difference between adjacent values (absolute diference if isabs=True) that fall outside this corridor, then aggregates using the function defined for the f_agg parameter.
- maximum (az): The absolute maximum value of the time series.
all_relevant_features_with_labels = pd.concat([extracted_features[feature_importance_sorted_df['Feature'].iloc[:5]], pd.DataFrame(target, columns=['target'])], axis=1)
all_relevant_features_with_labels.to_csv(data_dir / "processed" / "all_relevant_features_with_labels.csv", index=False)
all_relevant_features_with_labels
| atotal__change_quantiles__f_agg_"var"__isabs_True__qh_0.8__ql_0.0 | ax__change_quantiles__f_agg_"var"__isabs_True__qh_1.0__ql_0.0 | ax__change_quantiles__f_agg_"mean"__isabs_True__qh_1.0__ql_0.0 | atotal__change_quantiles__f_agg_"mean"__isabs_True__qh_0.6__ql_0.0 | az__absolute_maximum | target | |
|---|---|---|---|---|---|---|
| r000 | 0.398618 | 0.255426 | 0.681111 | 0.812642 | 11.20 | True |
| r001 | 0.913306 | 1.991750 | 1.277879 | 0.942778 | 14.89 | True |
| r002 | 0.506248 | 0.774904 | 0.859192 | 0.862727 | 10.66 | True |
| r003 | 0.497177 | 0.762527 | 0.776869 | 0.850545 | 10.30 | True |
| r004 | 0.544153 | 0.565282 | 0.916970 | 0.819091 | 12.20 | True |
| ... | ... | ... | ... | ... | ... | ... |
| w516 | 0.047892 | 0.022006 | 0.261414 | 0.334737 | 2.00 | False |
| w517 | 0.019968 | 0.021154 | 0.245455 | 0.209310 | 2.35 | False |
| w518 | 0.018304 | 0.021092 | 0.216364 | 0.163860 | 1.64 | False |
| w519 | 0.069663 | 0.049792 | 0.261414 | 0.291636 | 2.32 | False |
| w520 | 0.038102 | 0.028880 | 0.308750 | 0.352105 | 2.50 | False |
1057 rows × 6 columns
sns.pairplot(all_relevant_features_with_labels, hue='target')
plt.savefig(fig_dir / 'five_most_important_feats_pairplot_fig.png')
plt.show()
When comparing the walking and running distributions, the difference is quite apparent. There is essentially no overlap between the two distributions on the diagonal entries, which makes these excellent features for classification. This is further supported by the scatter plots, which also show a clear separation between walking and running activities.